Introduction

Crucial to all good statistical analyses is a thorough exploratory analysis. In this document, we will introduce a few techniques for developing an understanding of our dataset, including an understanding of its limitations.

Load packages

Let’s load all the required R packages. See our tutorial 0_Installing_packages.html for installation information.

library(rgeos)
library(rgdal)
library(sp)
library(maptools)
library(spatstat)
library(spdep)
library(INLA)
library(inlabru)
library(readxl)
library(lubridate)
library(ggmap)
library(raster)

File set up

If you have not done so already, you need to download the two set of files associated with the workshop.

All the workshop script and tutorial files can be found on Github at github.com/joenomiddlename/DFO_SDM_Workshop_2020. To download these files on your computer, press on Code and then Download ZIP. Once downloaded, unzip the DFO_SDM_Workshop_2020-main.zip file. To be able to access some of the precompiled data, we need to make the unzipped folder the working directory. To do so in R Studio, navigate to the correct folder using the bottom right panel (folder view, you can use button) and open it. Then, click “Set as Working Directory” under the tab ‘More’.

We should be inside the folder titled: ‘DFO_SDM_Workshop_2020’, you can verify this by using getwd().

Now we can load in the precompiled data

list2env(readRDS('./Data/Compiled_Data_new.rds'), globalenv())
## <environment: R_GlobalEnv>

Fin whale data

Throughout this workshop, we use data collected on fin whales in June, July, and August across the years 2007-2009 and for 2011. There are two main types of data:

  • Sightings from systematic aerial surveys, found in Sightings_survey. The sightings information contain counts of fin whales, although for this tutorial we simplify the analysis and only model presence/absence. These systematic survey also contain information such as the distance of the sightings from the plane. Aerial survey sightings are accompanied with the aerial tracklines, found in Effort_survey. In these data objects, we combined three surveys conducted by both DFO and NOAA: 2007 T-NASS Aerial Survey (Lawson et al. 2009), NOAA NARW Surveys (Cole, NOAA), and NOAA Cetacean Surveys (Cole et al., NOAA).

  • Opportunistic fin whale sightings from two whale-watching operators, found in Sightings_DRWW_sp. This dataset is not accompanied by tracklines. Again, these datasets contains whale counts, but for simplicity we will analyze them as presence/absence data in these tutorials. These sightings are maintained in the DFO Maritimes Region Whale Sightings Database (MacDonald et. al. 2017).

  • Distance from port for both the Quoddy and Brier whale watch vessels. These covariates will be used to model effort from the whale watch vessels.

  • Ocean depth data, found in Bathym. Bathymetry data will be our main covariate in our analysis and we will explore whether the distribution of fin whales is linked to bathymetry.

Coordinate reference system

These data are spatially-referenced data. The first thing we must always do is check for consistency between the coordinate reference systems (CRS) of each spatial object in use! To print the CRS of an sp object, simply add @proj4string to the name of the spatial object and run.

Sightings_DRWW_sp@proj4string 
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
Sightings_survey@proj4string 
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
Effort_survey@proj4string 
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
Domain@proj4string 
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
Bathym@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
WW_ports@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
Dist_Brier@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
Dist_Quoddy@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs

All of the spatial objects are in lat/lon - good! For future analysis we will be projecting the data into a different coordinate reference system to better preserve euclidean distance.

Finally, let’s turn off all warnings associated with coordinate reference systems. The latest PROJ6+/GDAL3+ updates have caused many warning messages to be printed.

rgdal::set_rgdal_show_exportToProj4_warnings(FALSE)
rgdal::set_thin_PROJ6_warnings(TRUE)
options("rgdal_show_exportToProj4_warnings"="none")

Plotting the data

Our first goal is generally to plot the data on a map. The gg() and gmap() functions from the inlabru package is extremely useful at plotting spatial data! The class of spatial objects we will use are from the sp package. The classes of these objects begin with ‘Spatial’. For example SpatialPointsDataFrame.

We have written a bespoke function gg.spatiallines_mod() to easily add SpatialLinesDataFrame objects to the plots too. This will prove useful for plotting transect lines. We load the bespoke functions to the working environment now.

source('utility_functions.R')

Let’s plot our data!

If the data are in lat/lon format then the gmap() function will automatically add a terrain layer to the plots. Let’s plot the survey sightings in blue, the survey tracklines in black, the whale-watch sightings in purple, the whale watch ports in red.

gmap(Sightings_survey) +
  gg(Domain) +
  gg.spatiallines_mod(Effort_survey) +
  gg(Sightings_survey, colour='blue') +
  gg(Sightings_DRWW_sp, colour='purple') +
  gg(WW_ports, colour='red')

Some of the maps used by default in gmap are copyrighted (e.g., Google maps, see ?get_map), see Additional tips for ways to use open-source maps for publication.

To remove the map layer, simply replace the gmap(Sightings_survey) with ggplot().

ggplot() + # Notice the empty ggplot() call
  gg(Domain) +
  gg.spatiallines_mod(Effort_survey) +
  gg(Sightings_survey, colour='blue') +
  gg(Sightings_DRWW_sp, colour='purple') +
  gg(WW_ports, colour='red')

This plot hides some crucial information regarding the data collection. For example, the survey sightings and tracklines do not come from a single survey, or even a single organisation! Let’s plot this! The easiest way to do this is to subset the data accordingly!

table(Effort_survey$DATASET)
## 
##    DFO NOAA_1 NOAA_2 
##     49     54     70
# there are 3 surveys
ggplot() +
  gg(Domain) +
  gg.spatiallines_mod(Effort_survey[Effort_survey$DATASET=='DFO',], colour='purple') +
  gg.spatiallines_mod(Effort_survey[Effort_survey$DATASET=='NOAA_1',], colour='red') +
  gg.spatiallines_mod(Effort_survey[Effort_survey$DATASET=='NOAA_2',], colour='yellow')

This is problematic! The DFO tracklines (in purple) do not overlap with the two NOAA surveys! Thus, any future model will be unable to identify any differences in protocol efficiency. This is because any model intercepts will be confounded with the latent spatial field. More on this later!

Exercise 1

In addition, the surveys were conducted across 4 separate years. Let’s plot the survey tracklines by year. Do you see any cause for concern? Note that the names of the variables in the Effort_Survey are: DATASET and YEAR. Try this on your own! If you get stuck, click ‘Show Code’. Hint: The YEAR variable is of type character and contains 4 unique values (see below).

table(Effort_survey$YEAR)
## 
## 2007 2008 2009 2011 
##  101   20   19   33
class(Effort_survey$YEAR)
## [1] "character"
ggplot() +
  gg(Domain) +
  gg.spatiallines_mod(Effort_survey[Effort_survey$YEAR=='2007',], colour='purple') +
  gg.spatiallines_mod(Effort_survey[Effort_survey$YEAR=='2008',], colour='red') +
  gg.spatiallines_mod(Effort_survey[Effort_survey$YEAR=='2009',], colour='blue') +
  gg.spatiallines_mod(Effort_survey[Effort_survey$YEAR=='2011',], colour='yellow')

Note that the surveys from years 2007, 2008 and 2009 covered largely different regions! Again, this is problematic if we want to model any changes in the whale distribution over time! The effect of year will be confounded by the spatial field. That being said, the data from 2011 appear to be a good candidate for model comparison as the spatial range overlaps with the other 3 years’ effort. We will holdout this data as our test data and use 2007, 2008, and 2009 as our training data.

xtabs(~ YEAR + DATASET, data=Effort_survey@data)
##       DATASET
## YEAR   DFO NOAA_1 NOAA_2
##   2007  49      3     49
##   2008   0     20      0
##   2009   0     19      0
##   2011   0     12     21

2011’s data comes exclusively from NOAA.

Transforming the data into a new CRS

For modelling, we will transform the data from lat/lon into a new “Canadian” CRS: NAD83 datum with UTM Zone 20N projection. This projection will help to preserve euclidean distance between points. We define the CRS object for this projection with EPSG code 2961:

Can_proj <- CRS("+init=EPSG:2961")
Can_proj <- fm_crs_set_lengthunit(Can_proj, unit='km')

The second line of code specifies that we want to work in units of km instead of the default meters. This can prove vital in applications to avoid numerical overflow.

Transforming spatial points, lines, and polygons

To do the transformation, we will use the spTransform() function. For example, we transform the whale-watch sightings spatial object Sightings_DRWW_sp as follow:

Sightings_DRWW_sp <- spTransform(Sightings_DRWW_sp, Can_proj)
Sightings_DRWW_sp@proj4string
## CRS arguments:
##  +proj=tmerc +lat_0=0 +lon_0=-63 +k=0.9996 +x_0=500000 +y_0=0
## +ellps=GRS80 +units=km +no_defs

Notice the changed output from calling @proj4string.

Exercise 2

Please repeat this for all the spatial objects that are points, lines or polygons.

Sightings_survey <- spTransform(Sightings_survey, Can_proj)
Effort_survey <- spTransform(Effort_survey, Can_proj)
WW_ports <- spTransform(WW_ports, Can_proj)
Domain <- spTransform(Domain, Can_proj)

Transforming raster-like spatial pixels objects

Transforming the ‘raster’-like SpatialPixelsDataFrame objects (Bathym, Dist_Brier, and Dist_Quoddy) using spTransform would be inappropriate here. The projection leads to a curvature of the pixels. A more appropriate approach here is to use bilinear interpolation. The projectRaster() function from the raster package works great for this. This requires converting the SpatialPixelsDataFrame object into an object of type raster. This is made easy with the function raster(). Finally, to convert the raster object back into a SpatialPixelsDataFrame, we can use the as() function from the maptools package. This function is extremely useful for converting spatial objects between the popular packages: sp, spatstat, and sf. We use this function substantially throughout the workshop.

Bathym <- as(projectRaster(raster(Bathym), crs=Can_proj), 'SpatialPixelsDataFrame')
Dist_Brier <- as(projectRaster(raster(Dist_Brier), crs=Can_proj), 'SpatialPixelsDataFrame')
Dist_Quoddy <- as(projectRaster(raster(Dist_Quoddy), crs=Can_proj), 'SpatialPixelsDataFrame')

Plot the (transformed) Bathymetry and Distance from Port spatial objects. We are going to combine these into a single plot using the multiplot() function from the inlabru package. This function takes as input ggplot objects and an argument layout, specifying how the plots should be arranged (see Additional tips for ways to change the layout).

multiplot(ggplot() +
  gg(Domain) +
  gg(Bathym) + xlab('East(km)') + ylab('North(km)') + labs(fill='Bathymetry') + 
    coord_fixed(ratio = 1),
ggplot() +
  gg(Domain) +
  gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)') + 
  coord_fixed(ratio = 1),
ggplot() +
  gg(Domain) +
  gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)')+ 
  coord_fixed(ratio = 1),
layout=matrix(1:4, nrow=2, ncol=2, byrow = TRUE))

Don’t like the colour scheme? See Additional tips to learn how to define your own manually.

Simplifying the coastline (domain)

JOE: explain in detail how to change the domain for the creating the mesh and the covariates. I understand that this is annoying because the model fitting is hard, so if we can’t get that, then fine. But we need to simplify this in some way without hiding the important parts. What I am asking is that if you change something else that would affect the model fit and will have to redo model fitting anyway, please do my suggestion. If not then, find a compromise solution. Talk to me about your solution first, so we decide together what’s the best course of action before you do a lot of work.

The Domain in its current form has a very complex coastline. In the second and third workshops, inlabru will require a triangulation mesh to be created that covers the Domain. Unfortunately, creating a triangulation mesh over a ‘wiggly’ coastline is very challenging and numerous issues can arise. To combat these issues, we will smooth the Domain using the functions gSimplify() and gBuffer(). gSimplify() attempts to reduce the number of segments used to define the SpatialPolygonsDataFrame. The amount of reduction is determined by the argument tol=. gBuffer() will extend (or buffer) the boundary of the SpatialPolygonsDataFrame by an amount determined by the argument width=. When used together, these two functions can help to define smoothed and simplified boundaries that completely contain the original complex Domain.

However, we should repeat the process once more for the covariates. inlabru requires that every covariate is defined at every point within the computational mesh. Because simplification of the boundary can occur when the mesh is created, it can help to define a second extended and simplified domain for defining the covariates over. By ensuring that this second extended domain is slightly larger than the first extended domain, we can ensure that a covariate value is properly defined at every value in the computational mesh. We create these two domains below:

Domain_restricted <- gSimplify(gBuffer(Domain,width=15),tol=20)
# Plot
ggplot() +
  gg(Domain_restricted) +
  gg(Domain, color='red') +
  gg.spatiallines_mod(Effort_survey, color='blue')

# Does the new domain contain all the survey tracklines?
gContains(Domain_restricted,Effort_survey)
## [1] TRUE
# Create a new domain that is slightly larger - for defining covariates later
Domain_restricted2 <- gSimplify(gBuffer(Domain,width=30),tol=20)
# Plot
ggplot() +
  gg(Domain_restricted) +
  gg(Domain_restricted2, color='green') +
  gg(Domain, color='red')

# Does it contain the old 
gContains(Domain_restricted2,Domain_restricted)
## [1] TRUE

Great! We have defined two new buffered and smoothed domains. The original domain (Domain) is shown in red and is where the computation will take place. The domain over which the computational mesh will be defined (Domain_restricted) is shown in black. Lastly, the domain over which the covariates will be defined (Domain_restricted2) is shown in green.

Once again, the smoothing of the coastline helps us to define a computational mesh that has fewer numerical issues. All computations will still take place over the original domain (Domain). More on that later!

Extending the covariates to match the new domain

The package inlabru will require that our covariates are defined at every point in Domain_restricted. To guarantee this, we will now extend our covariates to take ‘sensible’ values at all points in Domain_restricted2. These imputed values will not affect the model estimates as the model will be fit to the original Domain.

Our covariate depth is well-defined on the original (un-buffered) domain as defined by the SpatialPolygons Domain. At values outside this, we choose to ‘fill-in’ these points with nearest neighbour imputations. The code below does this for depth.

JOE: see my comment below with regards to the bathymetry and removing values that are on land. Changing the values of the bathymetry (as in removing the pixels on land and explaining why log_depth and why log(1-log(depth)) was actually the thing I was suggesting to move here, not the restricting the covariate values to the mesh. I think the stuff associated with the mesh makes more sense in the next tutorial. What I’m envisioning is actually removing the values above 0 in the file that is save in the data file Compiled_Data_new.rds, make them NAs. And keep the explanation of log(1 - depth) here, since I think that’s what you use in the Point process model below anyway and you can talk about why the transformation of the variable makes sense for the Point process model and for the other analysis in one go. Keep the crop to mesh size in tutorial 2 after you talk about the mesh. See the two comments below.

## Redefine log_Depth across our modified domain
# Create the log depth and log slope covariates
log_Depth <- Bathym
log_Depth$FIWH_MAR_Bathymetry[log_Depth$FIWH_MAR_Bathymetry >= 0] <- 0
log_Depth$FIWH_MAR_Bathymetry <- log(1-log_Depth$FIWH_MAR_Bathymetry)
names(log_Depth) <- 'log_Depth' 

# 1) Define a set of pixels across our modified domain
pixels_Domain <- as(SpatialPoints(makegrid(Domain_restricted2, n=100000),proj4string = Domain@proj4string),'SpatialPixels')[Domain_restricted2,]

# 2) Extract values of the covariate at the new pixel locations
pixels_Domain$log_Depth <- over(pixels_Domain,log_Depth)$log_Depth
# 3) impute missing values with the nearest neighbour value
pixels_Domain$log_Depth[is.na(pixels_Domain$log_Depth)] <- 
  log_Depth$log_Depth[nncross(as(SpatialPoints(pixels_Domain@coords[which(is.na(pixels_Domain$log_Depth)),]),'ppp'),
                          as(SpatialPoints(log_Depth@coords),'ppp'),
                          what = 'which')]
log_Depth <- pixels_Domain
# 4) Create a squared log depth covariate (for later)
log_Depth_sq <- log_Depth
names(log_Depth_sq) <- 'log_Depth_sq'
log_Depth_sq$log_Depth_sq <- log_Depth_sq$log_Depth_sq^2

# Plot the covariates with Domain_restricted overlayed
ggplot() + gg(log_Depth) + gg(Domain_restricted)

We also need to buffer and rescale the distance from port covariates Dist_Brier and Dist_Quoddy. Unlike with previous covariates, however, we will fix all buffered values on land equal to a large constant. This will help to ensure that negligible effort is recorded from the whale watch vessels on land through \(\lambda_{eff}\).

We show how to do it with the distance to Brier.

# 1) Define a set of pixels across our modified domain
pixels_Domain <- as(SpatialPoints(makegrid(Domain_restricted2, n=100000),proj4string = Domain@proj4string),'SpatialPixels')[Domain_restricted2,]

# Extract the average value of distance at the newly created pixel locations
pixels_Domain$Dist_Brier <- over(pixels_Domain,Dist_Brier)$Dist_Brier
# There are some missing values due to newly created pixels being on land
# Fill in missing values with a very large value (1000km).
# This will make the effect of these regions negligible on inference as lambda_eff will be small
pixels_Domain$Dist_Brier[is.na(pixels_Domain$Dist_Brier)] <- 1e3

Dist_Brier <- pixels_Domain
ggplot() + gg(Dist_Brier) + gg(Domain)

# There is an infinity value at the port. Change to 0
Dist_Brier$Dist_Brier[is.infinite(Dist_Brier$Dist_Brier)] <- 0
max(Dist_Brier$Dist_Brier)
## [1] 1000
# Let's scale the Dist covariates closer to the (0,1) scale
Dist_Brier$Dist_Brier <- Dist_Brier$Dist_Brier / 980.7996

Exercise

Now, please rescale and buffer the distance to Quoddy object.

pixels_Domain <- as(SpatialPoints(makegrid(Domain_restricted2, n=100000),proj4string = Domain@proj4string),'SpatialPixels')[Domain_restricted2,]

pixels_Domain$Dist_Quoddy <- over(pixels_Domain,Dist_Quoddy)$Dist_Quoddy
pixels_Domain$Dist_Quoddy[is.na(pixels_Domain$Dist_Quoddy)] <- 1e3

Dist_Quoddy <- pixels_Domain
# There is an infinity value at the port. Change to 0
Dist_Quoddy$Dist_Quoddy[is.infinite(Dist_Quoddy$Dist_Quoddy)] <- 0
# Let's scale the Dist covariates closer to the (0,1) scale
Dist_Quoddy$Dist_Quoddy <- Dist_Quoddy$Dist_Quoddy / 980.7996
ggplot() + gg(Dist_Quoddy) + gg(Domain)

Exploring detection function and effort layers of survey data

JOE: this was below in the ppm stuff and it was a bit confusing when lumped with the survey distance trackline, at first sight it felt like the histogram was for the nearest distance you calculated withthe trackline. Since in lecture 1 you talk about detection function in general, it makes sense to look at it in a seperate section.

To explore which detection function (\(p(d)\)) that should be used with the survey data, we will first plot the histogram of measured distances from the plane (i.e. the observed distances of sightings from the trackline).

# What is the maximum detection distance?
ggplot(Sightings_survey@data, aes(x=DISTANCE)) + geom_histogram()

The histogram of the perpendicular distances from the aircraft shows an apparent maximum detection probability at a value of distance above 0! Low detection probability close to 0 is common for aerial surveys since it is harder to see below the plane. Given the limited size of the survey data and the potential numerical issues we may face fitting a complex detection probability function, we will fix all values less than 250m equal to 250m. This crude approach will help to make the histogram appear as a monotonically decreasing function of distance, satisfying the simplest class of detection probability functions. While we will use this modified distance data throughout the tutorials, it should not be done in practice. When you have enough data, a 2-parameter detection probability function should be used instead.

In addition, it is often advised in distance sampling applications to threshold our upper detection distances at a ‘sensible’ value. For the histogram above, it looks like 2km could be a reasonable threshold value to choose. Let’s set all values above 2km equal to 2km and scale the distances onto the km units scale. This rescaling to units of km is crucial for numerical stability. We define a variable ‘distance’ that contains the rescaled distances. Note that a few distances are missing. We impute these with the mean distance.

# Setting the distances <250 to 250
Sightings_survey@data$DISTANCE <- ifelse(Sightings_survey@data$DISTANCE>250,Sightings_survey@data$DISTANCE,250)

# Setting the distances >2000 to 2000
Sightings_survey@data$DISTANCE <- 
  ifelse(Sightings_survey$DISTANCE>2000 & !is.na(Sightings_survey$DISTANCE),
         2000,Sightings_survey$DISTANCE)

# Needs renaming to match the formula argument
Sightings_survey$distance <- Sightings_survey$DISTANCE
# Impute missing distances with the mean
Sightings_survey$distance[is.na(Sightings_survey$distance)] <- mean(Sightings_survey$distance,na.rm=T)
# Remove the old DISTANCE column
Sightings_survey <- Sightings_survey[,-c(8)]
# Rescaling to km
Sightings_survey$distance <- Sightings_survey$distance / 1000

# Plot the modified distances
ggplot(Sightings_survey@data, aes(x=distance)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histogram perhaps shows an initial steady decline in sightings with distance, before decaying more quickly as distance increases. Again, we need to remember that this plot is showing associations between distance \(d\) and \(\lambda_{obs}\) and not with \(p(d)\). JOE: you need to explain this, I don’t understand why it’s not something close to \(p(d)\) and don’t fully understand how it links to the half-norm below. Is it because the p(d) is on the effort? I made some changes below assuming that this was the reason. I’m not sure when you mentioned this (the ‘again’ above implies that we have talked about this before)? In the lecture? To capture this shape, we will use a half-norm detection function:

\[p(d) = exp\left(\frac{-d^2}{2\sigma^2}\right)\]

This requires us to square the values of distance from the trackline. To do so, we need to first calculate the distance for the trackline, which will be our effort layer.

We can create a SpatialPixelsDataFrame containing the perpendicular distances to the nearest trackline. This fails to account for overlap in survey tracklines, but offers a reasonable first approach to crudely control for effort for the purposes of exploratory analysis.

Dist_Surveylines <- as(Dist_Brier,'SpatialPoints')
Dist_Surveylines$Dist_surveylines <- as.numeric(apply(gDistance(Dist_Surveylines, 
                                                                Effort_survey,byid = T),2,min))
Dist_Surveylines <- as(Dist_Surveylines,'SpatialPixelsDataFrame')

We can then map the distance to the nearest trackline. JOE: my understanding from the original (now cut) description in the previous paragraph, though I find the code above a bit confusing, is that it is combining the distance to port to the survey line distance. I’m confused why the below only shows the distance to trackline. Is the first line of code above just setting the format?

colsc <- function(...) {
  scale_fill_gradientn(colours = rev(RColorBrewer::brewer.pal(11,"RdYlBu")),
                       limits = range(...))
}
# Map the distance through space
ggplot() + gg(Domain) + gg(Dist_Surveylines) + colsc(Dist_Surveylines@data[,1])

We can now square the survey line distance to work with the half-norm detection function.

Dist_Surveylines$Dist_surveylines <- Dist_Surveylines$Dist_surveylines^2

Exploring covariates for effort in whale-watch data

JOE: again this was the ppm section, but this is general info that is useful for your own models so we want to divorce it from the ppm stuff and put it here own its on. I think you also talk about it in lecture one, so it fits here very well.

Remember from lecture 1, when we do not have effort data, we can model it with covariates. Throughout our tutorials, we will use distance from port as a covariate of effor for the the whale-watch data.

Here, we investigate a potential functional form for the distance from port covariate. To do this, we plot a histogram of the distances from port at each of the whale watch sighting locations.

# Plot the sightings with distance from the port
hist(over(Sightings_Brier_nodup,Dist_Brier)$Dist_Brier, main='Histogram of the distance from port of the Brier sightings', xlab = 'Distance from port')

hist(over(Sightings_Quoddy_nodup,Dist_Quoddy)$Dist_Quoddy,breaks=20, main='Histogram of the distance from port of the Quoddy sightings', xlab = 'Distance from port')

For both ports, we detect a decreasing frequency of sightings made as the distance from port increases. This relationship is clearer for Brier sightings than for Quoddy sightings. Furthermore, the functional form of the effect does not appear to be an exponential decay. Based on the histograms, we choose to model the functional form as a half-normal function. More complicated functional forms (e.g. weibull or hazard functions) could be used.

More specifically, we assume throughout the same functional relationship between each whale-watch effort intensity \(\lambda_{eff, i}\) and distance from their port location \(s_i^*\). Here \(i\) indicates the port and \(s_i^*\) denotes the \(i^{th}\) port location. We scale the half-norm functions by a unique constant \(\lambda_i\), to capture the different number of vessels across the two ports: \[\lambda_{eff,i}(s) = \lambda_i exp\left(\frac{-||s-s_i^*||^2}{2\sigma_i^2} \right)\].

On the log scale, this covariate can be estimated as a linear effect on the squared distance from port. Thus, we create SpatialPixelsDataFrame objects which store the squared distances from port.

Dist_Brier_sq <- as(Dist_Brier,'SpatialPixels')
Dist_Brier_sq$Dist_Brier_sq <- Dist_Brier$Dist_Brier^2
Dist_Quoddy_sq <- as(Dist_Quoddy,'SpatialPixels')
Dist_Quoddy_sq$Dist_Quoddy_sq <- Dist_Quoddy$Dist_Quoddy^2

Exercises

If you got stuck on any of the exercises, then please feel free to try them again. Here are links to the problems:

  1. Plotting data by year

  2. Transforming spatial points and lines

Bonus Exercises

  1. Transform one of the spatial objects back into longitude latitude.
  2. Transform one of the spatial objects into units of meters.
  3. (More challenging) Investigate the relationship of distance from port with the two sources of whale watch data? What do you notice? What functional form could explain this type of behaviour? (Hint: use the over() function to extract the values of a spatial covariate (of type SpatialPixelsDataFrame) at point locations (stored as a SpatialPointsDataFrame object). Remember to subset the data accordingly!)

Additional tips and code

Using OpenStreetMaps instead of Google Maps

For publication, there can be issues regarding copyright of Google Maps. Using OpenStreetMap can help. To guarantee this simply add the following argument: source='osm', force=TRUE to gmap(). Double check the console that the maps are indeed being downloaded from stamen or osm. For brevity we have suppressed the messages.

gmap(Sightings_survey, source='osm', force=TRUE) +
  gg(Domain) +
  gg.spatiallines_mod(Effort_survey) +
  gg(Sightings_survey, colour='blue') +
  gg(Sightings_DRWW_sp, colour='purple') +
  gg(WW_ports, colour='red')

Note for this to work it needs to be in the original project (lat/lon), not UTM.

The multiplot() function is a very flexible function that enables publication-quality figures to be made with relative ease.

Changing the layout of multiplot()

To change the order of the plots you can change the argument byrow=TRUE to byrow=FALSE.

multiplot(ggplot() +
  gg(Domain) +
  gg(Bathym) + xlab('East(km)') + ylab('North(km)') + labs(fill='Bathymetry'),
ggplot() +
  gg(Domain) +
  gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)'),
ggplot() +
  gg(Domain) +
  gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)'),
layout=matrix(1:4, nrow=2, ncol=2, byrow = FALSE))

We can also change the size of the figures by changing the matrix in the layout.

multiplot(ggplot() +
  gg(Domain) +
  gg(Bathym) + xlab('East(km)') + ylab('North(km)') + labs(fill='Bathymetry'),
ggplot() +
  gg(Domain) +
  gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)'),
ggplot() +
  gg(Domain) +
  gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)'),
layout=matrix(c(1,1,2,3), nrow=2, ncol=2, byrow = TRUE))

As you can see plots of different size stretches the maps, to keep the set projection we can use coord_fixed(ratio = 1).

multiplot(ggplot() +
  gg(Domain) +
  gg(Bathym) + xlab('East(km)') + ylab('North(km)') + labs(fill='Bathymetry') +
    coord_fixed(ratio = 1),
ggplot() +
  gg(Domain) +
  gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)') + 
  coord_fixed(ratio = 1),
ggplot() +
  gg(Domain) +
  gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)') + 
  coord_fixed(ratio = 1),
layout=matrix(c(1,1,2,3), nrow=2, ncol=2, byrow = TRUE))

Define your own colour palette

We can define the colour palette easily.

colsc <- function(...) {
  scale_fill_gradientn(colours = rev(RColorBrewer::brewer.pal(11,"PuBuGn")),
                       limits = range(...))
}

Look at ?RColorBrewer::brewer.pal to see what other colour palettes are available.

multiplot(ggplot() +
  gg(Domain) +
  gg(Bathym) + xlab('East(km)') + ylab('North(km)') + labs(fill='Bathymetry') +
  colsc(Bathym@data[,1]),
ggplot() +
  gg(Domain) +
  gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)') +
  colsc(Dist_Brier@data[,1]),
ggplot() +
  gg(Domain) +
  gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)') +
  colsc(Dist_Quoddy@data[,1]),
layout=matrix(1:4, nrow=2, ncol=2, byrow = T))
## Warning in RColorBrewer::brewer.pal(11, "PuBuGn"): n too large, allowed maximum for palette PuBuGn is 9
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(11, "PuBuGn"): n too large, allowed maximum for palette PuBuGn is 9
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(11, "PuBuGn"): n too large, allowed maximum for palette PuBuGn is 9
## Returning the palette you asked for with that many colors

Have a go at creating your own colour palette function. Investigate the effects of changing both arguments to brewer.pal.

colsc2 <- function(...){
  scale_fill_gradientn(colours = rev(RColorBrewer::brewer.pal(7,"Spectral")),
                       limits = range(...))
}
multiplot(ggplot() +
  gg(Domain) +
  gg(Bathym) + xlab('East(km)') + ylab('North(km)') + labs(fill='Bathymetry') +
  colsc2(Bathym@data[,1]),
ggplot() +
  gg(Domain) +
  gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)') +
  colsc2(Dist_Brier@data[,1]),
ggplot() +
  gg(Domain) +
  gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)') +
  colsc2(Dist_Quoddy@data[,1]),
layout=matrix(1:4, nrow=2, ncol=2, byrow = T))

Potential repeated whale-watch and survey sightings in 2011

Could we not also put 2011’s whale-watch data into the training set? Below is some code showing that we cannot. A whale is sighted by at least one whale watch company on every day in 2011 that a survey detects a whale. The plot below shows that these sightings could have been of the same animal. Even if they weren’t, the tracklines in 2011 still visited areas that the whale watch vessels were present. Since our training and test datasets are required to be independent of each other, we must therefore also remove the 2011 whale watch sightings from the training data.

# what dates were survey sightings made on in 2011?
unique(Sightings_survey_test$DATE_LO[Sightings_survey_test$YEAR==2011])
## [1] "2011-06-21" "2011-08-12" "2011-08-13"
# Are sightings by the WW vessels made on these dates?
sum(grepl(Sightings_survey_test$DATE_LO[Sightings_survey_test$YEAR==2011][1],
     x=c(Sightings_Quoddy_nodup_test$WS_DATE[Sightings_Quoddy_nodup_test$YEAR==2011],
         Sightings_Brier_nodup_test$WS_DATE[Sightings_Brier_nodup_test$YEAR==2011])))>0
## [1] TRUE
sum(grepl(Sightings_survey_test$DATE_LO[Sightings_survey_test$YEAR==2011][2],
     x=c(Sightings_Quoddy_nodup_test$WS_DATE[Sightings_Quoddy_nodup_test$YEAR==2011],
         Sightings_Brier_nodup_test$WS_DATE[Sightings_Brier_nodup_test$YEAR==2011])))>0
## [1] TRUE
sum(grepl(Sightings_survey_test$DATE_LO[Sightings_survey_test$YEAR==2011][3],
     x=c(Sightings_Quoddy_nodup_test$WS_DATE[Sightings_Quoddy_nodup_test$YEAR==2011],
         Sightings_Brier_nodup_test$WS_DATE[Sightings_Brier_nodup_test$YEAR==2011])))>0
## [1] TRUE
# Could they be of the same animal? 
ggplot() + gg(Domain) + 
  gg(Sightings_survey_test[Sightings_survey_test$YEAR==2011,],colour='green') +
  gg.spatiallines_mod(Effort_survey_test[Effort_survey_test$YEAR==2011,],colour='yellow')

Acknowledgements

The code for hiding the Rmd code chunks came from Martin Schmelzer, found here

References to Data Sources:

References/Sources for data sets:

DFO Maritimes Region Whale Sightings Database

MacDonald, D., Emery, P., Themelis, D., Smedbol, R.K., Harris, L.E., and McCurdy, Q. 2017. Marine mammal and pelagic animal sightings (Whalesightings) database: a users guide. Can. Tech. Rep. Fish. Aquat. Sci. 3244: v + 44 p.

NOAA NARW Surveys

Timothy V.N. Cole. National Oceanic Atmospheric Administration, National Marine Fisheries Service, Northeast Fisheries Science Center. 166 Water Street, Woods Hole, MA, USA

2007 TNASS DFO Aerial Survey

Lawson. J.W., and Gosselin, J.-F. 2009. Distribution and preliminary abundance estimates for cetaceans seen during Canada’s marine megafauna survey - A component of the 2007 TNASS. DFO Can. Sci. Advis. Sec. Res. Doc. 2009/031. vi + 28 p.

NOAA Cetacean Surveys

Timothy V.N. Cole and D. Palka. National Oceanic Atmospheric Administration, National Marine Fisheries Service, Northeast Fisheries Science Center. 166 Water Street, Woods Hole, MA, USA


  1. Alternatively, we can use the function over() from the sp package. We choose to create im objects so that we can later use the ppm function from spatstat for exploratory analysis purposes. ppm fits Poisson process models, which can be equivalent to Maxent models.↩︎